AITopics | hazardous knowledge

Collaborating Authors

hazardous knowledge

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

An Adversarial Perspective on Machine Unlearning for AI Safety

Łucki, Jakub, Wei, Boyi, Huang, Yangsibo, Henderson, Peter, Tramèr, Florian, Rando, Javier

arXiv.org Artificial IntelligenceNov-8-2024

Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2409.18025

Country:

North America > United States > Virginia (0.04)
Europe > United Kingdom (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (0.92)
Health & Medicine > Therapeutic Area > Hematology (0.92)
Health & Medicine > Pharmaceuticals & Biotechnology (0.92)
Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs

Kadhe, Swanand Ravindra, Ahmed, Farhan, Wei, Dennis, Baracaldo, Nathalie, Padhi, Inkit

arXiv.org Artificial IntelligenceJun-17-2024

Large language models (LLMs) have shown to pose social and ethical risks such as generating toxic language or facilitating malicious use of hazardous knowledge. Machine unlearning is a promising approach to improve LLM safety by directly removing harmful behaviors and knowledge. In this paper, we propose "SPlit, UNlearn, MerGE" (SPUNGE), a framework that can be used with any unlearning method to amplify its effectiveness. SPUNGE leverages data attributes during unlearning by splitting unlearning data into subsets based on specific attribute values, unlearning each subset separately, and merging the unlearned models. We empirically demonstrate that SPUNGE significantly improves the performance of two recent unlearning methods on state-of-the-art LLMs while maintaining their general capabilities on standard academic benchmarks.

knowledge, punge, toxicity, (16 more...)

arXiv.org Artificial Intelligence

2406.1178

Country:

South America > Colombia > Meta Department > Villavicencio (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(6 more...)

Genre: Research Report > Promising Solution (0.34)

Industry: Information Technology (0.47)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Li, Nathaniel, Pan, Alexander, Gopal, Anjali, Yue, Summer, Berrios, Daniel, Gatti, Alice, Li, Justin D., Dombrowski, Ann-Kathrin, Goel, Shashwat, Phan, Long, Mukobi, Gabriel, Helm-Burger, Nathan, Lababidi, Rassin, Justen, Lennart, Liu, Andrew B., Chen, Michael, Barrass, Isabelle, Zhang, Oliver, Zhu, Xiaoyuan, Tamirisa, Rishub, Bharathi, Bhrugu, Khoja, Adam, Zhao, Zhenqi, Herbert-Voss, Ariel, Breuer, Cort B., Marks, Samuel, Patel, Oam, Zou, Andy, Mazeika, Mantas, Wang, Zifan, Oswal, Palash, Lin, Weiran, Hunt, Adam A., Tienken-Harder, Justin, Shih, Kevin Y., Talley, Kemper, Guan, John, Kaplan, Russell, Steneker, Ian, Campbell, David, Jokubaitis, Brad, Levinson, Alex, Wang, Jean, Qian, William, Karmakar, Kallol Krishna, Basart, Steven, Fitz, Stephen, Levine, Mindy, Kumaraguru, Ponnurangam, Tupakula, Uday, Varadharajan, Vijay, Wang, Ruoyu, Shoshitaishvili, Yan, Ba, Jimmy, Esvelt, Kevin M., Wang, Alexandr, Hendrycks, Dan

arXiv.org Artificial IntelligenceMay-15-2024

The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai

hazardous knowledge, information, knowledge, (15 more...)

arXiv.org Artificial Intelligence

2403.03218

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > New York (0.04)
North America > United States > Massachusetts (0.04)
(6 more...)

Genre: Research Report > Promising Solution (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Researchers Develop New Technique to Wipe Dangerous Knowledge From AI Systems

TIME - TechMar-6-2024, 18:16:22 GMT

A study published Tuesday provides a newly-developed way to measure whether an AI model contains potentially hazardous knowledge, along with a technique for removing the knowledge from an AI system while leaving the rest of the model relatively intact. Together, the findings could help prevent AI models from being used to carry out cyberattacks and deploy bioweapons. The study was conducted by researchers from Scale AI, an AI training data provider, and the Center for AI Safety, a nonprofit, along with a consortium of more than 20 experts in biosecurity, chemical weapons, and cybersecurity. The subject matter experts generated a set of questions that, taken together, could assess whether an AI model can assist in efforts to create and deploy weapons of mass destruction. The researchers from the Center for AI Safety, building on previous work that helps to understand how AI models represent concepts, developed the "mind wipe" technique.

ai model, ai safety, knowledge, (11 more...)

TIME - Tech

Country: North America > United States (0.29)

Genre: Research Report > New Finding (0.88)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.58)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.48)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.31)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback